Lesson 4


Scatterplots and Perceived Audience Size

Notes:


Scatterplots

Notes: follow along on what the ggplot syntax consists of

library(ggplot2)
pf <- read.delim('pseudo_facebook.tsv')
ggplot(aes(x = age, y = friend_count), data = pf) +
  geom_point()


What are some things that you notice right away?

Response:the majority of ages with large friend groups are in the 20s and 100s.


ggplot Syntax

Notes: difference between qplot and ggplot is that ggplot allows you to make more complex plots but you would have to specify what type of geom plot and use the aesthetic wrapper

qplot(x= age, y= friend_count, data= pf)

ggplot(aes(x= age, y=friend_count), data= pf) + geom_point()

summary(pf$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   20.00   28.00   37.28   50.00  113.00

Overplotting

Notes: alpha is good for overplotting since it takes up, for example, 20 points to make one point if alpha = 1/20. geom_jitter is used for discrete, not continuous points.

ggplot(aes(x= age, y=friend_count), data= pf) + geom_jitter(alpha = 1/20) + xlim(13,90)
## Warning: Removed 5172 rows containing missing values (geom_point).

What do you notice in the plot?

Response:there is a less amount of people who have a high friend count.


Coord_trans()

Notes: we need to change geom_jitter to geom_point and add coord_trans(y= ‘sqrt’) at the end. we use geom_point instead of jitter because that will add negative noise to our graph and there’s no such thing as a negative age. Then pass in position parameter within geom_point so that we do not get any negative numers.

?coord_trans

Look up the documentation for coord_trans() and add a layer to the plot that transforms friend_count using the square root function. Create your plot!

ggplot(aes(x= age, y=friend_count), data= pf) + geom_point(alpha = 1/20, position = position_jitter(h=0)) + xlim(13,90) + 
  coord_trans(y= 'sqrt')
## Warning: Removed 5190 rows containing missing values (geom_point).

What do you notice?

a lot of people around age 70 have friend count less than 1000 ***

Alpha and Jitter

Notes: Explore the relationship between friendships_initiated and age

names(pf)
##  [1] "userid"                "age"                  
##  [3] "dob_day"               "dob_year"             
##  [5] "dob_month"             "gender"               
##  [7] "tenure"                "friend_count"         
##  [9] "friendships_initiated" "likes"                
## [11] "likes_received"        "mobile_likes"         
## [13] "mobile_likes_received" "www_likes"            
## [15] "www_likes_received"
ggplot(aes(x= age, y=friendships_initiated), data= pf) + geom_point(alpha = 1/10, position = position_jitter(h=0)) + xlim(13,90) + coord_trans(y= 'sqrt')
## Warning: Removed 5191 rows containing missing values (geom_point).


Overplotting and Domain Knowledge

Notes: A lot of people underestimate how many people see their facebook post. The graph shows perceived audience size vs actual audience size (percentage) ***

Conditional Means

Notes:Important Notice! Please note that in newer versions of dplyr (0.3.x+), the syntax %.% has been deprecated and replaced with %>%.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
pf.fc_by_age <- pf %>%
  group_by(age) %>%
  summarise(friend_count_mean = mean(friend_count),
            friend_count_median = median(friend_count),
            n = n()) %>%
  arrange(age)

head(pf.fc_by_age, 20)
## # A tibble: 20 x 4
##      age friend_count_mean friend_count_median     n
##    <int>             <dbl>               <dbl> <int>
##  1    13              165.                74.0   484
##  2    14              251.               132.   1925
##  3    15              348.               161.   2618
##  4    16              352.               172.   3086
##  5    17              350.               156.   3283
##  6    18              331.               162.   5196
##  7    19              334.               157.   4391
##  8    20              283.               135.   3769
##  9    21              236.               121.   3671
## 10    22              211.               106.   3032
## 11    23              203.                93.0  4404
## 12    24              186.                92.0  2827
## 13    25              131.                62.0  3641
## 14    26              144.                75.0  2815
## 15    27              134.                72.0  2240
## 16    28              126.                66.0  2364
## 17    29              121.                66.0  1936
## 18    30              115.                67.5  1716
## 19    31              118.                63.0  1694
## 20    32              114.                63.0  1443

Create your plot!

ggplot(aes(x=age, y=friend_count_mean), data= pf.fc_by_age) + geom_line()


Overlaying Summaries with Raw Data

Notes:ggplot 2.0.0 changes the syntax for parameter arguments to functions when using stat = ‘summary’. To denote parameters that are being set on the function specified by fun.y, use the fun.args argument, e.g.: ggplot( … ) + geom_line(stat = ‘summary’, fun.y = quantile, fun.args = list(probs = .9), … ) To zoom in, the code should use thecoord_cartesian(xlim = c(13, 90)) layer rather than xlim(13, 90) layer.

ggplot(aes(x=age, y=friend_count), data = pf) + 
  coord_cartesian(xlim = c(13, 90)) +
  geom_point(alpha=0.05, 
             position = position_jitter(h=0),
             color = 'orange') +
  coord_trans(y= 'sqrt') +
  geom_line(stat= 'summary', fun.y=mean) +
  geom_line(stat= 'summary', fun.y=quantile, fun.args = list(probs = .1), linetype = 2, color = 'blue') +
  geom_line(stat= 'summary', fun.y=quantile, fun.args = list(probs = .9), linetype = 2, color = 'blue') +
  geom_line(stat= 'summary', fun.y=quantile, fun.args = list(probs = .5), color = 'blue')

What are some of your observations of the plot?

Response:the middle 50% geom_line is slightly below the mean line. it is probably because the mean is skewed one direction due to outliers as opposed to the median (50%) line that do not take into account the outliers.


Moira: Histogram Summary and Scatterplot

See the Instructor Notes of this video to download Moira’s paper on perceived audience size and to see the final plot.

Notes: People dont underestimate as badly when asked how many people do you think saw your post in the whole month?


Correlation

Notes:

cor.test(x=pf$age, y=pf$friend_count, method= 'pearson')
## 
##  Pearson's product-moment correlation
## 
## data:  pf$age and pf$friend_count
## t = -8.6268, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03363072 -0.02118189
## sample estimates:
##         cor 
## -0.02740737

Look up the documentation for the cor.test function.

What’s the correlation between age and friend count? Round to three decimal places. Response:


Correlation on Subsets

Notes:

with(subset(pf, age<=70), cor.test(age, friend_count))
## 
##  Pearson's product-moment correlation
## 
## data:  age and friend_count
## t = -52.592, df = 91029, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1780220 -0.1654129
## sample estimates:
##        cor 
## -0.1717245

Create Scatterplots

Notes:

ggplot(aes(x=www_likes_received, y= likes_received), data = pf) + geom_point() +
  scale_x_continuous(limits = c(0,25000)) + scale_y_continuous(limits = c(0, 30000))
## Warning: Removed 12 rows containing missing values (geom_point).


Strong Correlations

Notes: ‘lm’ stands for linear model

ggplot(aes(x=www_likes_received, y= likes_received), data = pf) + geom_point() +
  xlim(0, quantile(pf$www_likes_received, 0.95)) +
  ylim(0, quantile(pf$likes_received, 0.95)) +
  geom_smooth(method = 'lm', color= 'red')
## Warning: Removed 6075 rows containing non-finite values (stat_smooth).
## Warning: Removed 6075 rows containing missing values (geom_point).

What’s the correlation betwen the two variables? Include the top 5% of values for the variable in the calculation and round to 3 decimal places.

cor.test(x=pf$www_likes_received, y=pf$likes_received, method= 'pearson')
## 
##  Pearson's product-moment correlation
## 
## data:  pf$www_likes_received and pf$likes_received
## t = 937.1, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9473553 0.9486176
## sample estimates:
##       cor 
## 0.9479902

Response: 0.948 ***

Moira on Correlation

Notes:


More Caution with Correlation

Notes:

library(alr3)
## Loading required package: car
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
summary(Mitchell)
##      Month             Temp        
##  Min.   :  0.00   Min.   :-7.4778  
##  1st Qu.: 50.75   1st Qu.:-0.3486  
##  Median :101.50   Median :10.4500  
##  Mean   :101.50   Mean   :10.3125  
##  3rd Qu.:152.25   3rd Qu.:20.4306  
##  Max.   :203.00   Max.   :27.6056

Create your plot!

ggplot(aes(x=Month, y=Temp), data= Mitchell) + geom_point()


Noisy Scatterplots

  1. Take a guess for the correlation coefficient for the scatterplot. 0

  2. What is the actual correlation of the two variables? (Round to the thousandths place) 0.057

cor.test(x=Mitchell$Month, y=Mitchell$Temp)
## 
##  Pearson's product-moment correlation
## 
## data:  Mitchell$Month and Mitchell$Temp
## t = 0.81816, df = 202, p-value = 0.4142
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.08053637  0.19331562
## sample estimates:
##        cor 
## 0.05747063

Making Sense of Data

Notes: Break up the x-axis so that every 12 months corresponds to a year. What layer would you add to your existing code to do this?

ggplot(aes(x=Month, y=Temp), data= Mitchell) + geom_point() + scale_x_continuous(breaks = seq(0, 203, 12))


A New Perspective

What do you notice? Response: Theres a cyclical graph similar to sinusoidal graph

Watch the solution video and check out the Instructor Notes! Notes: ggplot(aes(x=(Month%%12),y=Temp), data=Mitchell)+ geom_point() ***

Understanding Noise: Age to Age Months

Notes: age_with_months = age + months in decimal form

pf$age_with_months <- pf$age + (12- pf$dob_month) / 12

Age with Months Means

age_groups_with_months <- group_by(pf, age_with_months)
pf.fc_by_age_months <- summarise(age_groups_with_months,
                                  friend_count_mean = mean(friend_count),
                                  friend_count_median = median(friend_count),
                                  n = n())
pf.fc_by_age_months <- arrange(pf.fc_by_age_months, age_with_months)

Programming Assignment


Noise in Conditional Means

ggplot(aes(x=age_with_months, y= friend_count_mean), data = subset(pf.fc_by_age_months, age_with_months < 71)) + geom_line()


Smoothing Conditional Means

Notes:If we increase bin width, we get a smoother line in our graph. We lose data that we can see. Thus, there is a geom_smooth() function to show the smooth line on graphs.

p1 <- ggplot(aes(x=age, y= friend_count_mean), data = subset(pf.fc_by_age, age < 71)) + geom_line() + geom_smooth()

p2 <- ggplot(aes(x=age_with_months, y= friend_count_mean), data = subset(pf.fc_by_age_months, age_with_months < 71)) + geom_line() + geom_smooth()

p3 <- ggplot(aes(x= round(age / 5) * 5, y= friend_count), data = subset(pf, age < 71)) + geom_line(stat = 'summary', fun.y=mean)

library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
grid.arrange(p2,p1,p3, ncol= 1)
## `geom_smooth()` using method = 'loess'
## `geom_smooth()` using method = 'loess'


Which Plot to Choose?

Notes: There is no best graph to choose. Each graph shows different data than another graph.


Analyzing Two Variables

Reflection: A lot of functions such as grid extra, ggplots, and coefficients.


Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes! install.packages(‘knitr’, dependencies = TRUE)